Support DataFrame Interchange Protocol (allow Polars DataFrames) #2888

ChristopherDavisUCI · 2023-02-15T00:14:17Z

Based on the discussion in #2868, this is an attempt to provide modest support for specifying data as a Polars DataFrame. We do not convert to a pandas DataFrame, so at this point, the encoding type (e.g., "Q", "N", ...) needs to be specified explicitly. (My reasoning is that it would be easier to add type inference later than to take it away.)

Example:

import polars as pl
import altair as alt

df = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
        "optional": [28, 300, None, 2, -30],
    }
)

alt.Chart(df).mark_bar().encode(
    x="A:O",
    y="optional:Q",
    color="cars:N"
)

mattijn · 2023-02-15T19:04:29Z

I'll review this later, it looks good, but I'm not against being a bit more experimental here.

Maybe we can explore using the dataframe protocol, https://data-apis.org/dataframe-protocol/latest/index.html. I know pyarrow, polars and pandas already support this through .__dataframe__().

I tried a few things, see here: https://gist.github.com/mattijn/45752432a65e1018512305d0ca228d40.
I was somehow hoping that the __dataframe__() returns an arrow serialisable object, but I'm not sure if this is the case, see also here: apache/arrow#33986 (comment) and pola-rs/polars#3727 (comment)

Fallback would be to serialise it into an IPC stream / feather byte array and parse this into the Vega-lite spec or in the HTML_TEMPLATE as var object.
I could not get this part to work, but it should be similar to https://observablehq.com/@vega/vega-lite-and-apache-arrow-no-plugin (more info https://github.com/vega/vega-loader-arrow#browser-use).

mattijn · 2023-02-15T23:28:12Z

I noticed a few issues that I was not able to review using inline suggestions. You can have a look to commit 1fbb7c1 what I changed.

I introduce a check on existence of dataframe protocol using hasattr(data, "__dataframe__"), I then continue checking if polars is in the __module__ name, but if we do it right then this becomes the agnostic part and that check can be removed.
I made sure that all of this happens after checking for pandas.DataFrame, we like to experiment with this without touching the current behaviour for pandas DataFrames.
There is currently no sanitization on the polars dataframe and we directly write it using .to_dicts(), until this moment it is all a dictionary, so no need to write it to row oriented json.
I extended the check_data_type and limit_rows functions with support for the __dataframe__ protocol.
And I changed the order in the _prepare_data in order to place it in front of the _consolidate_data which will place the inline data to top-level with a unique name.

Now the following example:

import polars as pl
import pandas as pd
import altair as alt


df_pl = pl.DataFrame(
    {
        "A": [9, 8, 7, 6, 5],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
        "optional": [28, 300, None, 2, -30],
    }
)

df_pd = pd.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
        "optional": [28, 300, None, 2, -30],
    }
)

c1 = alt.Chart(df_pl, height=20, title='polars.DataFrame').mark_bar().encode(
    x="A:O",
    y="optional:Q",
    color="cars:N"
)

c2 = alt.Chart(df_pd, height=20, title='pandas.DataFrame').mark_bar().encode(
    x="A:O",
    y="optional:Q",
    color="cars:N"
)

print(alt.vconcat(c1, c2).to_json())

returns:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.2.0.json",
  "config": {
    "view": {
      "continuousHeight": 300,
      "continuousWidth": 300
    }
  },
  "datasets": {
    "data-90855424d1d3c3df2b450ef6e4564242": [
      {
        "A": 9,
        "cars": "beetle",
        "optional": 28
      },
      {
        "A": 8,
        "cars": "audi",
        "optional": 300
      },
      {
        "A": 7,
        "cars": "beetle",
        "optional": null
      },
      {
        "A": 6,
        "cars": "beetle",
        "optional": 2
      },
      {
        "A": 5,
        "cars": "beetle",
        "optional": -30
      }
    ],
    "data-faf1e5382ce70f32dc6c22613bf3493d": [
      {
        "A": 1,
        "cars": "beetle",
        "optional": 28.0
      },
      {
        "A": 2,
        "cars": "audi",
        "optional": 300.0
      },
      {
        "A": 3,
        "cars": "beetle",
        "optional": null
      },
      {
        "A": 4,
        "cars": "beetle",
        "optional": 2.0
      },
      {
        "A": 5,
        "cars": "beetle",
        "optional": -30.0
      }
    ]
  },
  "vconcat": [
    {
      "data": {
        "name": "data-90855424d1d3c3df2b450ef6e4564242"
      },
      "encoding": {
        "color": {
          "field": "cars",
          "type": "nominal"
        },
        "x": {
          "field": "A",
          "type": "ordinal"
        },
        "y": {
          "field": "optional",
          "type": "quantitative"
        }
      },
      "height": 20,
      "mark": {
        "type": "bar"
      },
      "title": "polars.DataFrame"
    },
    {
      "data": {
        "name": "data-faf1e5382ce70f32dc6c22613bf3493d"
      },
      "encoding": {
        "color": {
          "field": "cars",
          "type": "nominal"
        },
        "x": {
          "field": "A",
          "type": "ordinal"
        },
        "y": {
          "field": "optional",
          "type": "quantitative"
        }
      },
      "height": 20,
      "mark": {
        "type": "bar"
      },
      "title": "pandas.DataFrame"
    }
  ]
}

mattijn · 2023-02-16T07:26:19Z

@ChristopherDavisUCI, can you extend this PR to make sure this experimental support of polars also covers the following two functions?:

https://github.com/altair-viz/altair/blob/45bbbb7398e68e6c696d3af6cbfcb16addb6c803/altair/utils/data.py#L189-L190
and
https://github.com/altair-viz/altair/blob/45bbbb7398e68e6c696d3af6cbfcb16addb6c803/altair/utils/data.py#L210-L211

mattijn · 2023-02-16T13:20:38Z

I just saw the recent merges of apache/arrow#14804 and pola-rs/polars#6581. Based on this I could get this to work:

import pyarrow as pa
import pyarrow.interchange as pi
import polars as pl
import pandas as pd

data = {'a': [1, 2, 3], 'b': [4, 5, 6]}
pa_table = pa.table(data)
pl_df = pl.DataFrame(data)
pd_df = pd.DataFrame(data)

interchange_pyarrow = pa_table.__dataframe__()
interchange_polars = pl_df.__dataframe__()
interchange_pandas = pd_df.__dataframe__()

interchange_pyarrow2table = pi.from_dataframe(interchange_pyarrow)
interchange_polars2table = pi.from_dataframe(interchange_polars)
interchange_pandas2table = pi.from_dataframe(interchange_pandas)

print(interchange_pyarrow2table.to_pylist() == interchange_polars2table.to_pylist() == interchange_pandas2table.to_pylist())

interchange_pyarrow2table.to_pylist()

True
[{'a': 1, 'b': 4}, {'a': 2, 'b': 5}, {'a': 3, 'b': 6}]

Currently we then first can do the serialization to pylist style on the python-side, but eventually in the future can transfer the buffer within the vega-lite specification or html template and do the serialization on the javascript-side.

Meaning that we can support the dataframe protocol with a single soft dependency on pyarrow>=11.0.0

ChristopherDavisUCI · 2023-02-16T15:04:48Z

Thanks for all these improvements @mattijn, it looks very promising! I don't think I'll have a chance to look closely before Saturday, but I will go through this over the weekend.

mattijn · 2023-02-16T20:36:39Z

Updated PR to use the dataframe protocol in combination with pyarrow.interchange.

import altair as alt
import pyarrow as pa
import polars as pl
import pandas as pd
import vaex


def chart(source, title):
    return (
        alt.Chart(source, height=20, title=title)
        .mark_bar()
        .encode(x="x:O", y="y:Q")
    )


data = {"x": [1, 2, 3], "y": [4, 5, 6]}
pa_table = pa.table(data)
df_polars = pl.DataFrame(data)
df_pandas = pd.DataFrame(data)
df_vaex = vaex.from_pandas(df_pandas)

dataframes = {
    "pyarrow": pa_table,
    "polars": df_polars,
    "pandas": df_pandas,
    "vaex": df_vaex,
}
alt.hconcat(*[chart(dataframes[df], df) for df in dataframes])

I think this is good to go.

joelostblom · 2023-02-17T22:12:04Z

This is cool! Could we you a note in the changelog saying that Altair now has basic support for all data frame libraries that support the __dataframe__ exchange protocol? I don't know how we would create tests for this without requiring all the df libraries to be in the dev requirements, so maybe let's skip that and just mention that this is rudimentary and might not pick up on all types of data like ordinal correctly (if that is true)?

mattijn · 2023-02-18T10:36:05Z

Since @ChristopherDavisUCI likes to go through this this weekend, will wait for his approval or suggestions before merging.

ChristopherDavisUCI · 2023-02-18T18:35:35Z

This seems great to me @mattijn, much more ambitious than what I started with!

Is pyarrow a dependency for Altair now, or only for using these new data sources? Do you see any downside to that? (When I tried my code above with the new updates, I got an error that I needed to install pyarrow. It worked fine after I installed pyarrow.) Am I right in understanding that pyarrow is a much more lightweight requirement than something like Polars? (I see pandas itself is a dependency of pyarrow.)

I trust your and @joelostblom's intuition, so good to merge from my perspective! Is there anything in particular you'd like me to try out? (I think your requests from #2888 (comment) are no longer relevant, because you got them implemented yourself, right?)

mattijn · 2023-02-18T19:23:53Z

Thanks for the comment @ChristopherDavisUCI! I did some changes to the error messages in the latest commit: e0cda9e.
Now it will says, if pyarrow is not installed:

"Usage of the DataFrame Interchange Protocol requires the package 'pyarrow', but it is not installed."

And if the installed version of pyarrow is too low:

"The installed version of 'pyarrow' does not meet the minimum requirement of version 11.0.0. "
"Please update 'pyarrow' to use the DataFrame Interchange Protocol."

pyarrow by itself is not a hard dependency of altair, but to access the DataFrame Interchange Protocol it is required as a soft dependency.
I see what you mean, with your question regarding pyarrow vs polars. Polars support reading of the DataFrame Interchange Protocol through pyarrow, so you'll need them both if you read the dataframe using functionality of polars.

I would be surprised if pandas is a dependency of pyarrow. Where did you see that? I can't see it in here: https://github.com/apache/arrow/tree/main/python (only in the test-requirements).

mattijn · 2023-02-18T19:29:04Z

To add: if this works out well and we can sanitize through pyarrow tables and do proper type checking of the fields, then eventually we can replace pyarrow over pandas. At that moment pyarrow is a hard dependency and pandas is not a dependency anymore. But currently it is just experimental.

ChristopherDavisUCI · 2023-02-18T19:41:29Z

I would be surprised if pandas is a dependency of pyarrow. Where did you see that?

You're right, I misremembered. pandas is listed here, but only as an optional dependency: https://arrow.apache.org/docs/python/install.html

Good to merge from my perspective!

Allow Polars DataFrames

f8d8552

ChristopherDavisUCI mentioned this pull request Feb 15, 2023

Support for pola.rs DataFrames #2868

Closed

3 tasks

black formatting

492c592

code review suggestions

1fbb7c1

improve code comment

45bbbb7

support dataframe interchange format

e02b646

linting

8db81cd

mattijn changed the title ~~Allow Polars DataFrames~~ Support dataframe protocol (allow Polars DataFrames) Feb 16, 2023

mattijn changed the title ~~Support dataframe protocol (allow Polars DataFrames)~~ Support DataFrame Interchange Protocol (allow Polars DataFrames) Feb 16, 2023

mattijn added 3 commits February 18, 2023 11:15

Merge branch 'master' into polars

089ec08

include info in release notes and data section

75e931f

fix typos in sentece

17f51f9

improve error messages

e0cda9e

mattijn mentioned this pull request Feb 18, 2023

create an extra_require group for soft dependencies #2818

Closed

Merge branch 'master' into polars

38330aa

mattijn merged commit 22a16e3 into vega:master Feb 18, 2023

joelostblom mentioned this pull request Mar 17, 2023

Note that polars work with Altair kevinheavey/modern-polars#15

Merged

ChristopherDavisUCI deleted the polars branch April 23, 2023 12:11

astrojuanlu mentioned this pull request Jun 16, 2023

Release the spec as PyPI package data-apis/dataframe-api#73

Open

ivirshup mentioned this pull request Aug 29, 2023

Idea: __dataframe__ interchange protocol for anndata scverse/anndata#1111

Open

labanyamukhopadhyay mentioned this pull request Aug 31, 2023

BUG: Altair incompatible with Modin modin-project/modin#5438

Closed

3 tasks

ivirshup mentioned this pull request Sep 6, 2023

Support for polars dataframes holoviz/datashader#1199

Open

mattijn mentioned this pull request Jun 25, 2024

wip-feat: pandas as soft dependency #3384

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support DataFrame Interchange Protocol (allow Polars DataFrames) #2888

Support DataFrame Interchange Protocol (allow Polars DataFrames) #2888

ChristopherDavisUCI commented Feb 15, 2023

mattijn commented Feb 15, 2023 •

edited

Loading

mattijn commented Feb 15, 2023

mattijn commented Feb 16, 2023 •

edited

Loading

mattijn commented Feb 16, 2023 •

edited

Loading

ChristopherDavisUCI commented Feb 16, 2023

mattijn commented Feb 16, 2023

joelostblom commented Feb 17, 2023

mattijn commented Feb 18, 2023

ChristopherDavisUCI commented Feb 18, 2023

mattijn commented Feb 18, 2023

mattijn commented Feb 18, 2023

ChristopherDavisUCI commented Feb 18, 2023

Support DataFrame Interchange Protocol (allow Polars DataFrames) #2888

Support DataFrame Interchange Protocol (allow Polars DataFrames) #2888

Conversation

ChristopherDavisUCI commented Feb 15, 2023

mattijn commented Feb 15, 2023 • edited Loading

mattijn commented Feb 15, 2023

mattijn commented Feb 16, 2023 • edited Loading

mattijn commented Feb 16, 2023 • edited Loading

ChristopherDavisUCI commented Feb 16, 2023

mattijn commented Feb 16, 2023

joelostblom commented Feb 17, 2023

mattijn commented Feb 18, 2023

ChristopherDavisUCI commented Feb 18, 2023

mattijn commented Feb 18, 2023

mattijn commented Feb 18, 2023

ChristopherDavisUCI commented Feb 18, 2023

mattijn commented Feb 15, 2023 •

edited

Loading

mattijn commented Feb 16, 2023 •

edited

Loading

mattijn commented Feb 16, 2023 •

edited

Loading